47 research outputs found

    A fault-tolerance protocol for parallel applications with communication imbalance

    Get PDF
    ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large machines are expected to foster. Therefore, resilient extreme-scale applications are an absolute necessity to effectively use the new generation of supercomputers. Rollback-recovery techniques have been traditionally used in HPC to provide resilience. Among those techniques, message logging provides the appealing features of saving energy, accelerating recovery, and having low performance penalty. Its increased memory consumption is, however, an important downside. This paper introduces memory-constrained message logging (MCML), a general framework for decreasing the memory footprint of message-logging protocols. In particular, we demonstrate the effectiveness of MCML in maintaining message logging feasible for applications with substantial communication imbalance. This type of applications appear in many scientific fields. We present experimental results with several parallel codes running on up to 4,096 cores. Using those results and an analytical model, we predict MCML can reduce execution time up to 25% and energy consumption up to 15%, at extreme scale

    Dise帽o de una infraestructura de computaci贸n de alto rendimiento para objetos paralelos en un lenguaje de alto nivel

    Get PDF
    Proyecto de investigaci贸n. C贸digo del proyecto: 540213700005La computaci贸n paralela ha alcanzado una posici贸n predominante en la 煤ltima d茅cada gracias a la abundancia de arquitecturas computacionales de m煤ltiples n煤cleos. Explotar el poder computacional disponible en los sistemas modernos ofrece una enorme posibilidad de avanzar el estado del arte en la ciencia y la ingenier铆a. El modelo de programaci贸n de objetos paralelos ofrece muchas ventajas con respecto a otros modelos en computaci贸n paralela. Sin embargo, este modelo no ha sido explorado en el contexto de lenguajes de alto nivel. Este proyecto se enfoc贸 en explorar las posibilidades de dise帽o de un sistema de computaci贸n de alto rendimiento para objetos paralelos en un lenguaje de alto nivel. Para lograr ese objetivo se hizo una recolecci贸n exhaustiva de herramientas en el lenguaje Python para computaci贸n de alto rendimiento. Esa colecci贸n demostr贸 la oportunidad que existe al combinar los dos dominios: objetos paralelos y un lenguaje de alto nivel. Adem谩s, el proyecto cre贸 un panorama de las posibilidades de dise帽o de tal combinaci贸n

    Algoritmos alternos de bajo coste para la comparaci贸n de rutas metab贸licas en plantas

    Get PDF
    Informe Final de Proyectos de Investigaci贸n y Extensi贸nMetabolic pathways provide key information to achieve a better understanding of life and all its processes; this is useful information for the improvement of medicine, agronomy, pharmacy and other similar areas. The main analysis tool used to study these pathways is based on the idea of pathway comparison, using graph data structures. Graph comparison has been defined as a computationally complex task. We propose two algorithms with different approaches which simplify the problem of comparing pathways represented as graphs. The first algorithm consists in the transformation of a two-dimensional graph structure to a one-dimensional structure, and thus aligning the corresponding data using a reduced 1D structure. The second algorithm consists in performing a pair analysis between graphs, that is to say a relation of 2 equal nodes present in both graphs, and thus eliminating all similarities, finally, showing these differences to the user. Our results show evidence of a quick, simple and effective way to resolve the described problem. The mechanism proposed in algorithm 1 can be used as a prior evaluator to predict good comparisons in case a deeper analysis is desired. We show that the loss of information or precision does not affect much the result, which is to give the user a similarity score between the two analyzed pathways. For algorithm 2 the proposal is to offer the expert an additional point of view for his evaluation of the pathway in question. In this case, no score is provided but the listed differences

    Using migratable objects to enhance fault tolerance schemes in supercomputers

    Get PDF
    Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this paper, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on thousands of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability. Additionally, we present a performance model that predicts a significant benefit of using migratable objects to provide fault tolerance at extreme scale

    Timed consistency: unifying model of consistency protocols in distributed systems

    Get PDF
    Ordering and timeliness are two different aspects of consis- tency of shared objects in distributed systems. Timed consistency [12] is an approach that considers simultaneously these two elements according to the needs of the system. Hence, most of well known consistency proto- cols are candidates to be unified under the Timed consistency approach, just by changing some of the time or order parameters.Red de Universidades con Carreras en Inform谩tica (RedUNCI

    A Study of Checkpointing in Large Scale Training of Deep Neural Networks

    Full text link
    Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC

    Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration

    Get PDF
    The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework鈥攕o long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergencePeer ReviewedPostprint (author's final draft

    Power, Reliability, Performance: One System to Rule Them All

    Get PDF
    En un dise帽o basado en el marco de programaci贸n paralelo Charm ++, un sistema de tiempo de ejecuci贸n adaptativo interact煤a din谩micamente con el administrador de recursos de un centro de datos para controlar la energ铆a mediante la programaci贸n inteligente de trabajos, la reasignaci贸n de recursos y la reconfiguraci贸n de hardware. Gestiona simult谩neamente la fiabilidad al enfriar el sistema al nivel 贸ptimo de la aplicaci贸n en ejecuci贸n y mantiene el rendimiento a trav茅s del equilibrio de carg

    Mejoramiento del modelo de la estructura interna de capas y corteza del Volc谩n Turrialba

    Get PDF
    Instituto Tecnol贸gico de Costa Rica. Escuela de Ingenier铆a en Computaci贸n. Informe final. C贸digo : 5402-1370-0007La actividad volc谩nica tiene un efecto importante en las actividades humanas y la infraestructura. Las recientes erupciones volc谩nicas de los volcanes Po谩s y Turrialba han impactado econ贸micamente a las comunidades circundantes: algunos parques nacionales y aeropuertos han debido cerrarse temporalmente; ganado, pobladores y escuelas se han tenido que reubicar. El reconocimiento de esta amenaza sirve como motivaci贸n para que las autoridades locales y la comunidad cient铆fica use infraestructura moderna de computaci贸n para mejorar nuestro entendimiento de los fen贸menos vulcanol贸gicos. Esta propuesta implica la construcci贸n de una plataforma de computaci贸n avanzada para mejorar el modelo de la estructura interna de capas de un volc谩n y la ubicaci贸n de temblores volc谩nico-tect贸nicos. Toda esta informaci贸n, junto con modelos te贸ricos, ofrecer谩 un mejor entendimiento de la din谩mica del Volc谩n Turrialba

    Framework para Simulaci贸n en Paralelo de Fen贸menos Sismol贸gicos y Vulcanol贸gicos

    Get PDF
    Proyecto de investigaci贸n. C贸digo del proyecto: 1370005Costa Rica es un pa铆s situado en el llamado Cintur贸n de Fuego del Pac铆fico, una zona altamente s铆smica que comprende pa铆ses en ambos extremos del Oc茅ano Pac铆fico. En Costa Rica, en promedio, se experimenta un sismo de magnitud 4.0 o superior diariamente. Es fundamental para el pa铆s contar con una plataforma computacional para entender mejor los fen贸menos sismol贸gicos y el efecto que pueden tener los sismos en la sociedad. Este proyecto tuvo como objetivo principal identificar las necesidades de simulaci贸n y procesamiento de datos de los observatorios sismol贸gicos del pa铆s (OVSICORI y RSN) y construir un framework que permitiera ejecutar esos programas. El entregable principal fue una primera versi贸n del framework para obtener sismogramas sint茅ticos. Se dise帽贸 una plataforma que simula sismos computacionalmente y que a la vez asocia informaci贸n geogr谩fica para crear videos del sismo con informaci贸n del entorno f铆sico. Esta integraci贸n permite una visualizaci贸n enriquecida de los fen贸menos. El framework integra varias herramientas de c贸digo libre que ejecutan en arquitecturas paralelas y que tienen la capacidad de simular una amplia variedad de escenarios. Este tipo de infraestructura es esencial para el pa铆s y demuestra el potencial que existe en la colaboraci贸n cient铆fica y el uso de tecnolog铆as de computaci贸n avanzada
    corecore